no_of_adults 0
no_of_children 0
no_of_weekend_nights 0
no_of_week_nights 0
type_of_meal_plan 0
required_car_parking_space 0
room_type_reserved 0
lead_time 0
arrival_year 0
arrival_month 0
arrival_date 0
market_segment_type 0
repeated_guest 0
no_of_previous_cancellations 0
no_of_previous_bookings_not_canceled 0
avg_price_per_room 0
no_of_special_requests 0
booking_status 0
dtype: int64
Performance of Predictive Models - The Interpretability and Explainability
Authors: Leona Hasani, Leona Hoxha, Nanmanat Disayakamonpan, Nastaran Mesgari
1 Project Overview
1.1 Introduction
Our projects takes into consideration three different datasets from Kaggle, each one of them from different industries: Cardiovascural Dataset from the health industry, Weather in Australia from the environmental industry, and Hotel Reservation from the business industry.
The objective of our project is to assess the performance of various supervised learning algorithms in predicting binary target variables. In the next section, we outline the key questions guiding our project, which we will answer throughout the project and at results and key findings.
Moreover, we aim to examine how the most effective supervised machine learning algorithm learns within a given dataset. To achieve this, we will utilize learning curves, which provide insights into the algorithm’s performance as it processes more training data. Additionally, we will devote significant attention to hyperparameter tuning to optimize model performance. By adjusting these parameters, we seek to identify any potential overfitting issues within the datasets. This analysis will involve visualizations showcasing the training and testing performance metrics across various hyperparameter settings.
The primary goal of this project is to enhance our understanding of supervised predictive models, with particular emphasis on overfitting. Overfitting is a complex concept that can be challenging to grasp, often leading to misconceptions. By delving into this topic, we aim to clarify its nuances and implications within the context of machine learning models. Through thorough examination and visualization of performance metrics, we aim to shed light on the factors contributing to overfitting and strategies for mitigating its effects.
1.2 Questions and Problems
In our project, we delve into a series of questions and challenges aimed at enhancing our model’s performance and interpretability. We prioritize the questions based on their significance and relevance as follows:
1. Can the implementation of more sophisticated modeling methods within our dataset lead to enhanced model performance, and how can we interpret such improvements?
2. Does it mean that if one model performs the best in one particular dataset, it would be the same for another dataset with the same method?
3. What is the impact of standardization and normalization techniques on the performance scores of our models?
4. Do we have any imbalanced dataset? If yes, what approach could we use to balance the data?
5. How can we analyze the trade-off dynamics between including all available features and employing feature selection techniques?
6. What approach can be employed to identify the optimal hyperparameters of specific models?
7. Is there a risk of overfitting within our datasets, and what measures can be taken to assess and mitigate this risk effectively?
After the preprocessing steps, exporatory data analysis, and modelling part, at the results and the conclusion part we will try to answer each one of the research questions that are listed above.
1.3 Core Methodology and Additional Elements
As our project delves into predictive models and their interpretability, we aim to provide concise explanations of each model. Additionally, we emphasize the importance of exploring additional techniques to enhance model performance and assess the risk of overfitting in our datasets. Therefore, we offer an overview of our core methodology and additional techniques employed in this project.
1.3.1 Resampling (Random Undersampling)
In many fields like healthcare, imbalanced datasets are common, where one class is much more prevalent than others. This can lead to biased models favoring the dominant class (Bach et al., 2019). One approach to address this is resampling, which involves adjusting the dataset to achieve a more balanced distribution through undersampling the majority class, oversampling the minority class, or a hybrid of both (Snieder et al., 2020). Undersampling, where the majority class is reduced, is suitable for our project, given the lower prevalence of heart disease compared to healthy cases. We’ll use an 80:20 undersampling ratio to strike a balance between improving the model’s ability to detect heart disease and maintaining a dataset representative of real-world distributions (Yanminsun et al.,2011).
1.3.2 Feature Selection: KBest
SelectKBest is one of the univariate feature selection methods. SelectKBest then selects the top k features with the highest scores, indicating they are the most relevant for predicting the target variable (Nair & Bhagat, 2019). Therefore, it helps focus on the task’s most important features, making the dataset more manageable and potentially improving the machine learning model’s performance.
1.3.3 Model Performance Metrics
Due to the characteristics of our target variables in all three datasets, we have to employ classification models, therefore the evaluation metrics will offer a quantitative assessment of how well the models perform (Programmer, 2023).
1.3.3.1 Accuracy
Accuracy is the most used performance metric for evaluating a binary classification model. It measures the proportion of correct predictions made by the model out of all the predictions. A high accuracy score indicates that the model is making a large proportion of correct predictions, while a low accuracy score indicates that the model is making too many incorrect predictions.
Accuracy is calculated using the following formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
where TP represents the number of true positives, TN the number of true negatives, FP the number of false positives, and FN is the number of false negatives(Programmer, 2023).
1.3.3.2 Precision
Precision is a metric that measures the proportion of true positives (TP) among the total that are predicted as positive by the model. In other words, precision measures the accuracy of the positive predictions made by the model. A high precision score indicates that the model is able to accurately identify positives, while a low precision score indicates that the model is making too many false positive (FP) predictions.
Precision is calculated using the following formula: Precision = TP / (TP + FP)
where TP is the number of true positives and FP is the number of false positives (Programmer, 2023).
1.3.3.3 Recall
Recall, also known as sensitivity or true positive rate (TPR), is a performance metric that measures the proportion of positives that are correctly identified by the model out of all the actual positives. In other words, recall measures the model’s ability to correctly identify positives. A high recall score indicates that the model is able to identify a large proportion of positives, while a low recall score indicates that the model is missing many positives.
Recall is calculated using the following formula: Recall = TP / (TP + FN)
where TP is the number of true positive instances and FN is the number of false negative instances (Programmer, 2023).
1.3.3.4 F1-score
F1-score is a performance metric that combines precision and recall to provide a comprehensive evaluation of the performance of a binary classification model. It measures the harmonic mean of precision and recall, giving equal importance to both metrics. A high F1-score indicates that the model is performing well in both precision and recall, while a low F1-score indicates that the model is not performing well in either precision or recall (Programmer, 2023).
F1-score is calculated using the following formula: F1-score = 2 * (precision * recall) / (precision + recall)
where precision is the proportion of true positive cases among all the cases predicted as positive, and recall is the proportion of true positive cases among all the actual positive cases.
1.3.3.5 AUC-ROC curve
The ROC(Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classifier, indicating the tradeoff between true positive rate (TPR) and false positive rate (FPR) at different thresholds. The AUC represents the area under this curve, which ranges from 0 to 1, with a higher AUC indicating better model performance (Programmer, 2023).
The TPR and FPR are defined as follows: True Positive Rate (TPR,Sensitivity) = True Positives / (True Positives + False Negatives) False Positive Rate (FPR,Specificity) = False Positives / (False Positives + True Negatives)
1.3.4 Models
In each of the datasets, we’ve applied six different classification algorithms. These algorithms are used to predict outcomes that are either true or false. The goal is to determine which model performs best in terms of accuracy, precision, and other performance measures for each specific dataset. This helps us understand which algorithm is most effective for a given dataset and prediction task. Before proceeding with the analysis and the project, it’s essential to grasp the functioning and construction of each model. Understanding each model’s mechanics provides insight into how it makes predictions and its underlying assumptions. This comprehension enables us to interpret the results more effectively and choose the most suitable model for our specific dataset and problem. All the classification models have been used from the sklearn library.
1.3.4.1 Logistic Regression Classifier
Logistic regression predicts the likelihood of an event based on independent variables, making it valuable for classification tasks. By transforming odds into probabilities, it generates predictions bounded between 0 and 1. Coefficients are optimized through maximum likelihood estimation, allowing for efficient prediction (IBM, 2022).
1.3.4.2 Decision Tree Classifier
A decision tree is a type of algorithm used in machine learning for tasks like sorting data into categories or making predictions. It’s like a flowchart, starting with a main question (the root node) and then branching out based on different answers (branches) to eventually reach final conclusions (leaf nodes). It’s designed to divide data into smaller, more manageable groups by making decisions at each step. The goal is to create simple, easy-to-understand rules that accurately predict outcomes. Decision trees can get complex as they grow, so techniques like pruning (removing unnecessary branches) and using ensembles (groups of trees) help keep them accurate and efficient (IBM, 2023).
1.3.4.3 Random Forest Classifier
A random forest is a machine learning algorithm that combines the outputs of multiple decision trees to make predictions. By using a collection of decision trees and injecting randomness into the process, random forests reduce the risk of overfitting and improve accuracy. Each tree in the forest is built on a subset of the data and a subset of features, resulting in a diverse set of trees that work together to provide more accurate predictions (IBM, 2023b).
1.3.4.4 Gradient Boosting Classifier
Gradient boosting is a powerful machine learning technique that combines weak learners, typically decision trees, into a strong predictive model. It operates by sequentially adding trees to correct the errors of the previous ones, using a gradient descent approach to minimize a chosen loss function. This method, marked by its flexibility and ability to handle various types of data, is enhanced through techniques like tree constraints, shrinkage, random sampling, and penalized learning, which mitigate overfitting and enhance predictive accuracy (Jason Brownlee, 2018).
1.3.4.5 KNeighbors Classifier
The K-Nearest Neighbors (KNN) classifier is a type of supervised learning algorithm used for classification tasks. It makes predictions based on the similarity of input data points to the known data points in the training dataset. By creating neighborhoods in the dataset, KNN assigns new data samples to the neighborhood where they best fit. KNN is particularly effective when dealing with numerical data and a small number of features, and it excels in scenarios with less scattered data and few outliers (Alves, 2021).
1.3.4.6 Adaboost Classier
AdaBoost, short for Adaptive Boosting, is a powerful ensemble learning algorithm that combines multiple weak classifiers to create a strong predictive model. Its main idea involves iteratively training weak classifiers on different subsets of the training data, assigning higher weights to misclassified samples in each iteration. By focusing on challenging examples, AdaBoost enables subsequent classifiers to improve their performance. The algorithm starts by assigning equal weights to all training examples, then iterates through training weak classifiers, adjusting sample weights and combining classifier predictions based on their performance. This process continues for a specified number of iterations, resulting in a final prediction based on the weighted votes of all weak classifiers (Wizards, 2023).
1.3.5 Learning Curve
A learning curve is applied to illustrate how well a model performs based on the amount of training data. It helps identify learning issues like underfitting or overfitting and assesses dataset representativeness. By comparing training and validation scores across different training set sizes, learning curves reveal how much the model improves with more data and whether its limitations are due to bias or variance errors (Giola et al, 2021).
1.3.6 Overfitting
Overfitting is when a machine learning model is too focused on the training data it’s seen before, so it struggles to make accurate predictions for new data. It’s like a student who memorizes answers but can’t solve new problems (Muralidhar, 2023).
Reasons behind overfitting: 1. Using a complex model for a simple problem which picks up the noise from the data. Example: Fitting a neural network to the Iris dataset. 2. Small datasets, as the training set may not be a right representation of the universe (What Is Overfitting? - Overfitting in Machine Learning Explained - AWS, n.d.).
For example, a model trained to find dogs in outdoor photos might miss dogs indoors because it learned to look for grass.
To spot overfitting, we test the model with more diverse data. One method is called K-fold cross-validation, where we split the training data into subsets and test the model’s performance on each.
To prevent overfitting, we can use strategies like early stopping, where we pause training before the model learns too much noise. Pruning focuses on important features and ignores irrelevant ones (Muralidhar, 2023).
1.4 Changing the hyperparameters of the models
In this project we will be changing manually some of the hyperparameters of the best performing model within each specific dataset. Those two hyperparameters will be number of estimators and the maximum depth, both at the Random Forest Classifier, and Gradient Boosting Classifier. So, before we delve with the project we first should know that what are those hyperparameters specifically.
1.4.1 Random Forest Classifier Hyperparameters
Number of estimators - According to (Scikit-learn, 2018) it is “the number of trees in the forest. The default number of estimators is 100”.
Maximum depth - According to (Scikit-learn, 2018) it is “the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. The default maximum depth is None”.
1.4.2 Gradient Boosting Classifier
Number of estimators - According to (3.2.4.3.5. Sklearn.ensemble.GradientBoostingClassifier — Scikit-Learn 0.20.3 Documentation, 2009) it is “The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. Values must be in the range 1 to infinity. The default number of estimators is 100”.
Maximum depth - According to (3.2.4.3.5. Sklearn.ensemble.GradientBoostingClassifier — Scikit-Learn 0.20.3 Documentation, 2009) it is “the maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. The default maximum depth is 3”.
2 Business Sector: Hotel Reservation Dataset
2.1 Hotel: Data Overview
The Hotel Reservations Dataset was taken from Kaggle (available from this link: https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset). This dataset has information related to hotel bookings from July of 2017 till December of 2018. It consists of 36,275 observations, each representing a unique booking. The dataset covers 19 different attributes that provide insights into the booking patterns, guest preferences, and hotel operations.
Below, we will find the table and the meaning of each of the variables in this dataset:
| Column Name | Meaning |
|---|---|
| Booking_ID | Unique identifier for each booking |
| no_of_adults | Number of adults included in the booking |
| no_of_children | Number of children included in the booking |
| no_of_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| no_of_week_nights | Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| type_of_meal_plan | Type of meal plan booked by the guest |
| required_car_parking_space | Indicates whether the guest required a parking space (0 - No, 1- Yes) |
| room_type_reserved | Type of room booked by the guest, ciphered (encoded) by INN Hotels |
| lead_time | Number of days between the booking date and the arrival date |
| arrival_year | Year of the guest’s arrival |
| arrival_month | Month of the guest’s arrival |
| arrival_date | Day of the month the guest arrived |
| market_segment_type | Segment to which the booking belongs, indicating the source or market type of the booking |
| repeated_guest | Indicates whether the guest is a repeated visitor (1 for repeated, 0 for new) |
| no_of_previous_cancellations | Number of previous bookings canceled by the guest |
| no_of_previous_bookings_not_canceled | Number of previous bookings not canceled by the guest |
| avg_price_per_room | Average price per room for the booking |
| no_of_special_requests | Number of special requests made by the guest (e.g. high floor, view from the room, etc) |
| booking_status | Indicates if the booking was canceled or not |
2.2 Hotel: Preprocessing Steps
We can see that Booking_ID(nunique=36275), type_of_meal_plan(nunique=4), room_type_reserved(nunique=7), market_segment_type(nunique=5) and target variable booking_status(nunique=2) are all object variables. Therefore for some of them we can do Label Encoding to help our machine learning models in the next steps. Moreover, we will delete the first column “Booking_ID” because it doesnt have any importance for our analysis.
For now, we want to check if there are any missing values in this data set:
From this, we can see that there are no missing values in this data set, so for now we are not going to remove anything else.
2.2.1 Label Encoding
Now, we want to use Label Encoding for the variables: type_of_meal_plan, room_type_reserved, market_segment_type and booking_status.
Unique Values of type_of_meal_plan:
[1 0 2 3]
Unique Values of room_type_reserved:
[1 4 2 6 5 7 3]
Unique Values of market_segment_type:
[0 1 2 3 4]
Unique Values of booking_status:
[0 1]
Now, all the variables in our dataset are numerical.
However, we have 3 variables (arrival_date, arrival_month, arrival_year) which have to do with the date of the arrival, therefore we want to merge the columns into one date column, and drop these 3 unnecessary columns (arrival_year, arrival_month, arrival_date).
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | arrival_date_full | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 1 | 0 | 1 | 224 | 0 | 0 | 0 | 0 | 65.00 | 0 | 0 | 2017-10-02 |
| 1 | 2 | 0 | 2 | 3 | 0 | 0 | 1 | 5 | 1 | 0 | 0 | 0 | 106.68 | 1 | 0 | 2018-11-06 |
| 2 | 1 | 0 | 2 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 60.00 | 0 | 1 | 2018-02-28 |
| 3 | 2 | 0 | 0 | 2 | 1 | 0 | 1 | 211 | 1 | 0 | 0 | 0 | 100.00 | 0 | 1 | 2018-05-20 |
| 4 | 2 | 0 | 1 | 1 | 0 | 0 | 1 | 48 | 1 | 0 | 0 | 0 | 94.50 | 0 | 1 | 2018-04-11 |
2.3 Hotel: Exploratory Data Analysis
2.3.1 Check whether the dataset is imbalanced
Now we want to see if our data is balanced or not:
Class Distribution:
0 24390
1 11885
Name: booking_status, dtype: int64
Class Proportions:
0 0.672364
1 0.327636
Name: booking_status, dtype: float64
Imbalance Ratio (Class 1 / Class 0): 0.487289872898729
According to this, with Class Proportions of 0 (Not cancelled) at 67% and 1 (Cancelled) at 33%, it appears that our dataset is not significantly imbalanced. Therefore, we can proceed with our analysis.
2.3.2 Booking Status Over Time
Now, we want to use Plotly Express to visualize the booking status over time (for the years 2017 and 2018). We filter the dataset to isolate records for each respective year and then create line plots to display the trend of booking status over time:
The green line presents the Total of Bookings, the blue line presents the Non-Cancelled Bookings, and the red line the Cancelled Bookings.
As we can see from this visualization, the number of non-cancelled bookings didn’t change much between 2017 and 2018, while the number of cancelled bookings did. It seems that the number of total bookings dropped down quite significantly from October 2017 and started to get back on track from January 2018. During this period the number of cancelled bookings was nearly zero, that’s why the graph lines of Total Booking and Non-Cancelled Bookings almost overlapping. Moreover, the number of cancelled bookings increased as the total number of bookings increased.
While the blue line or “Non-Cancelled Bookings” are considered as “Neto Bookings” because these are calculated by subtracting the number of Cancelled Bookings from the Total Bookings. We observe a significant drop in Net Bookings from 1611 on October 1, 2017, to 620 on November 1, 2017. Following this decline, Net Bookings began to gradually increase until March 2018. After March 2018, the Net Bookings remained relatively stable for the subsequent months of 2018, therefore we can state that there doesn’t appear to be any anomalies.
2.3.3 Correlation Heatmap for the Hotel Dataset - numerical features
In this heatmap we can see that there are much variables that are highly correlated to each other. For example, booking_status is moderately positively correlated with the lead_time, which means if the guests books for a later arrival day, it will be more likely that the booking will be cancelled. On the other hand, if the guest books at the last minute, it will be more likely that the booking will be cancelled. We also have the room_type_reserved positivey correlated with avg_price_per_room, which is obvious, the better the room is the higher is going to be the price (means that Room 7 is quite more premium than Room 1). We also have repeated_guest positively correlated with no_of_previous_bookings_not_canceled, which is also clear, since returning guests tend to have a higher number of previous bookings that were not canceled, indicating their satisfaction and loyalty. Moreover, guests who have cancelled ten to not book again.
2.3.4 Boxplot of the numerical features for the Hotel Dataset
This next code shows boxplots to visualize the distribution of numerical variables in the hotel dataset.
After looking at the boxplots: Most of the boxplots seem normal, but one stands out: the average price. It’s boxplot has an outlier around 500. This comes from a canceled booking, which means the person didn’t stay at the hotel. We decide to keep it in our data because it’s important for the model to learn from all kinds of situations. Another number we find odd is the count of children in some bookings. We see some with 9 or 10 children. This seems strange for a hotel booking, so we decide to remove those from our data.
2.3.5 Histogram for the numerical features for the Hotel Reservations Dataset
In order to see the skewness of the numerical features we need to plot histograms for each of the variables. If we see that a particular one of the variables are skewed that we can use the logarithmic properties in order to make that particular feature normally distributed.
From here, wer can see that only the variable lead_time is skewed (positively). Therefore we want to use log transformation to it to make it normally distributed.
Now, we want to remove the lead_time variable from the data set and only use the log transformed one for our analysis.
2.4 Hotel: Modeling
2.4.1 Modeling Summary
For this part of the project, we explore various machine learning models to predict hotel booking_status. We start by preprocessing the data, including splitting it into training and testing sets, with a 0.2 ratio, and standardizing the features using StandardScaler. Next, we use the SelectKBest method to identify the top features for modeling. Then, we train multiple classifiers including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and KNN classifiers. For each model, we evaluate its performance using metrics such as accuracy, precision, recall, F1-score, and ROC AUC score. Finally, we analyze the results to determine the best-performing model for predicting hotel booking statuses.
| Model | Accuracy | Precision | Recall | F1-score | ROC AUC Score | Computational Time | |
|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression Classifier | 0.779738 | 0.683010 | 0.598214 | 0.637806 | 0.732515 | 0.200245 |
| 1 | Logistic Regression Classifier Scaled | 0.778360 | 0.676974 | 0.605017 | 0.638976 | 0.733265 | 0.056284 |
| 2 | Logistic Regression Classifier with Feature Se... | 0.714404 | 0.630354 | 0.287840 | 0.395213 | 0.603435 | 0.165056 |
| 3 | Logistic Regression Classifier with Feature Se... | 0.716885 | 0.623960 | 0.318878 | 0.422060 | 0.613345 | 0.045115 |
| 4 | Decision Tree Classifier | 0.856788 | 0.770499 | 0.795068 | 0.782591 | 0.840732 | 0.140496 |
| 5 | Decision Tree Classifier Scaled | 0.855961 | 0.770376 | 0.791667 | 0.780876 | 0.839235 | 0.143149 |
| 6 | Decision Tree Classifier with Feature Selection | 0.796416 | 0.686091 | 0.685799 | 0.685945 | 0.767640 | 0.112617 |
| 7 | Decision Tree Classifier with Feature Selectio... | 0.797932 | 0.689478 | 0.685374 | 0.687420 | 0.768651 | 0.109633 |
| 8 | Random Forest Classifier | 0.887939 | 0.848348 | 0.796769 | 0.821750 | 0.864222 | 3.957353 |
| 9 | Random Forest Classifier Scaled | 0.887388 | 0.847757 | 0.795493 | 0.820794 | 0.863482 | 3.951000 |
| 10 | Random Forest Classifier with Feature Selection | 0.805100 | 0.697723 | 0.703656 | 0.700677 | 0.778710 | 4.063125 |
| 11 | Random Forest Classifier with Feature Selectio... | 0.808408 | 0.707686 | 0.696854 | 0.702228 | 0.779388 | 4.118062 |
| 12 | Gradient Boosting Classifier | 0.847140 | 0.816285 | 0.681973 | 0.743109 | 0.804172 | 2.638402 |
| 13 | Gradient Boosting Classifier Scaled | 0.847140 | 0.816285 | 0.681973 | 0.743109 | 0.804172 | 2.698068 |
| 14 | Gradient Boosting Classifier with Feature Sele... | 0.775879 | 0.696855 | 0.546344 | 0.612488 | 0.716166 | 1.996979 |
| 15 | Gradient Boosting Classifier with Feature Sele... | 0.775879 | 0.696855 | 0.546344 | 0.612488 | 0.716166 | 1.980496 |
| 16 | KNN Classifier | 0.840524 | 0.764030 | 0.735119 | 0.749296 | 0.813103 | 0.481364 |
| 17 | KNN Classifier Scaled | 0.852653 | 0.779521 | 0.760629 | 0.769959 | 0.828714 | 1.568069 |
| 18 | KNN Classifier with Feature Selection | 0.790903 | 0.676682 | 0.679847 | 0.678261 | 0.762012 | 0.302915 |
| 19 | KNN Classifier with Feature Selection Scaled | 0.795589 | 0.692512 | 0.664541 | 0.678238 | 0.761497 | 0.721915 |
| 20 | AdaBoost Classifier | 0.807030 | 0.718951 | 0.664541 | 0.690676 | 0.769962 | 0.841016 |
| 21 | AdaBoost Classifier Scaled | 0.807030 | 0.718951 | 0.664541 | 0.690676 | 0.769962 | 0.845534 |
| 22 | AdaBoost Classifier with Feature Selection | 0.756720 | 0.674598 | 0.482143 | 0.562361 | 0.685289 | 0.726621 |
| 23 | AdaBoost Classifier with Feature Selection Scaled | 0.756720 | 0.674598 | 0.482143 | 0.562361 | 0.685289 | 0.680749 |
From this final table containing all the results from the models we used, we can observe that the Random Forest Classifier (which performs almost the same as the Random Forest Scaled) demonstrates the best performance for this dataset. It achieved the highest scores across all measures: Accuracy (~89%), which means that the model about 89% correctly predicts whether the booking will be cancelled or not. Precision (~85%), which means that out of all the bookings the model predicted as cancelled, 85% of them were actually cancelled. Recall (~80%), which means that the model correctly identifies 80% of the actual bookings of cancelled. F1-score (~82%), which means that that the model has achieved a pretty good score of precision and recall. ROC&AUC Score (~86%), which shows that there is a great discrimination between the positive and negative classes. Although it has a computational time of approximately 4 seconds, this is not considered significant given its best performance.
2.4.2 Best Model Performance
For this dataset, the Random Forest Classifier stood out as the best performer across the different metrics like ‘Accuracy’, ‘Precision’, ‘Recall’, ‘F1-score’, ‘ROC AUC Score’, and ‘Computational Time’. So, we’re going to focus more on this model for the next step of our project. We’ll do tasks like Cross-Validation to see if the results of the original version are reliable and consistent, and checking for any signs that our model might be too focused on the training data (over-fitting).
In this section, we’ll rerun the Random Forest Classifier model and compare it with its Cross-Validated version.
| Model | Accuracy | Precision | Recall | F1-score | ROC AUC Score | Computational Time | |
|---|---|---|---|---|---|---|---|
| 0 | Random Forest Classifier | 0.883942 | 0.843324 | 0.788870 | 0.815189 | 0.859238 | 3.856658 |
| 1 | Random Forest Classifier (CV) | 0.883689 | 0.852845 | 0.780902 | 0.816014 | 0.940655 | 16.370657 |
As we can see, the difference Random Forest model and its Cross Validated version of it, is very small, therefore we can state that the results of Random Forest Classifier are reliable to continue with our anaysis.
2.5 Hotel: Additional Techiniques
2.5.1 Learning Curve
In this next code, we are going to check the learning curve of the model depending on the training samples (examples). We are going to show 3 different learning curves, where the first one is the original model, second one has max_depth=10 and n_estimators=100, and third one has max_depth=30 n_estimators=100. We want to compare how these three learning curves look and check if there’s overfitting possibilities for these models.
The learning curve plot, as shown in Hotel Figure 1, shows how well the model learns as we give it more examples to study. When we train it with 5000 examples, it gets everything right for the training data, showing it can remember all those examples perfectly. On the other hand, for the testing data, it performs the best when we use all 30,000+ samples from the dataset.
This insight helps us understand that the model is able to perform even better for the testing data if we added more new samples to the data set. Moreover, a more varied dataset can lead to better generalization of the model, enabling it to handle new data more effectively.
2.5.2 Checking for Overfitting
In this step, we want to check overfitting with the Random Forest Classifier Dataset using different settings: max depth ranging from 1 to 30, and the number of estimators set at 50, 100, and 150. We’re trying out different setups to see how well the model works with different levels of difficulty. This helps us find the best mix of making the model smart enough without making it too specific to certain cases.
2.5.2.1 Random Forest Classifier with 50 estimators
2.5.2.2 Random Forest Classifier with 100 estimators
2.5.2.3 Random Forest Classifier with 150 estimators
When exploring different combinations of the Random Forest Classifier’s parameters—specifically, the maximum depth and the number of estimators—we analyzed the results depicted in Hotel Figure 3a), 3b), and 3c). We found that increasing the maximum depth generally improved the model’s performance on the training set, including metrics like accuracy, precision, recall, F1-score, and ROC AUC score. However, this improvement wasn’t as pronounced for the testing set.
The best maximum depth range seemed to be between 10 to 15. In this range, the differences between testing and training scores were smaller compared to maximum depths between 15 to 30.
When considering the number of estimators, there wasn’t much difference between having 50, 100, or 150 estimators. It seems that the number of estimators didn’t have a significant impact on how well the model learned.
In conclusion, for this model, it appears that any number of estimators between 50 to 150 is suitable. However, a maximum depth in the range of 10 to 15 seems to lead to the most balanced performance between the training and testing datasets.
2.6 Hotel: Key Findings
The dataset didn’t show class imbalance, with approximately 67% of bookings not being canceled and 33% being canceled.
Visualizing the booking status over time revealed interesting trends. While the number of non-canceled bookings remained relatively stable, the number of canceled bookings varied over time. There were significant drops in net bookings during certain periods, followed by gradual recovery.
The correlation heatmap showed several variables that were highly correlated with each other. For example, lead time was positively correlated with booking status, indicating that longer lead times were associated with a higher likelihood of booking cancellation.
Various machine learning models were evaluated, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and KNN classifiers. Among these, the Random Forest Classifier demonstrated superior performance as the best-performing model for predicting hotel booking statuses. It achieved high scores across various metrics, including accuracy, precision, recall, F1-score, and ROC AUC score.
Learning curves and overfitting analysis were conducted to ensure the model’s generalization ability. The results indicated that a maximum depth in the range of 10 to 15 led to balanced performance between training and testing datasets.
3 Environment Sector: Weather in Australia Dataset
3.1 Weather: Data Overview
The Weather in Australia Dataset was taken from Kaggle (available from this link: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package). This dataset contains 145,460 observations and 19 variables for the weather which 14 of them are numerical features and the rest of them are categorical and type date.
Below, we will find the table and the meaning of each of the variables in this dataset:
| Column Name | Meaning |
|---|---|
| Date | The date of the observation |
| Location | he common name of the location of the weather station |
| MinTemp | The minimum temperature in degrees celsius |
| MaxTemp | The maximum temperature in degrees celsius |
| Rainfall | The amount of rainfall recorded for the day in mm |
| Evaporation | The so-called Class A evaporation (mm) in the 24 hours to 9am |
| Sunshine | The number of hours of bright sunshine in the day |
| WindGustDir | The direction of the strongest wind gust in the 24 hours to midnight |
| WindGustSpeed | The speed (km/h) of the strongest wind gust in the 24 hours to midnight |
| WindDir9am | Direction of the wind at 9am |
| WindDir3pm | Direction of the wind at 3pm |
| WindSpeed9am | Speed of the wind 10 min prior to 9am (km/h) |
| WindSpeed3pm | Speed of the wind 10 min prior to 3pm (km/h) |
| Humidity9am | Humidity of the wind at 9am |
| Humidity3pm | Humidity of the wind at 3pm |
| Pressure9am | Atmospheric pressure at 9am |
| Pressure3pm | Atmospheric pressure at 3pm |
| Cloud9am | cloud-obscured portions of the sky at 9am (eighths) |
| Cloud3pm | cloud-obscured portions of the sky at 3pm (eighths) |
| Temp9am | Temperature at 9am (degree Celsius) |
| Temp3pm | Temperature at 3pm (degree Celsius) |
| RainToday | If today is rainy then ‘Yes’, if not then ‘No’ |
| RainTomorrow | Target Variable: If tomorrow is rainy then ‘Yes’, if not then ‘No’ |
3.2 Weather: Preprocessing Steps
we can see that some of the variables are the type of object, but first let’s see the missing values and decide if whether there are any missing values, and if so are we going to delete row-wise or column-wise?
From the heatmap we can see that the features of the dataset which have more than 50 percent of the total observations as null values are Evaporation, Sunshine, Cloud9am and Cloud3pm and we decide to remove them from the dataset, since we believe that these variables are not key factors in determining whether it is going to rain tomorrow or not. Moreover, since in this project the focus is not in the implementing different imputation techniques in order to see how well they would perform in different machine learning algorithms, we decide to get rid of them.
Then we will check if there are any duplicates observations, since they would lead to biased results, therefore we need to delete them.
And also after deleting the columns, we then need to check also how many missing values in each column we have and we decide now to delete row-wise. If we check the target variable ‘Rain Tomorrow’ we can see that this feature has 3267 missing observations and therefore we should delete and even if suppose we would have used imputation techniques we should have not impute this column because it will lead to biased results.
After this preprocessing steps, our dataset now has 112925 rows and 19 features, which is not a big loss of information.
3.3 Weather: Exploratory Data Analysis
3.3.1 Check whether the dataset is imbalanced
Now, we want to see whether the target variable which is RainTomorrow with value 1 when it will rain, and 0 when it won’t rain. We want to see whether there is an imbalance between the two classes.
We can see that the class 0 has 77.84% of the total observations in the dataset and class 1 holds 22.16% of the total observations.
From this we can say that the dataset is not imbalanced.
3.3.2 Correlation Heatmap for the numerical features
After the preprocessing steps we want to see that how correlated are with each other the numerical variables in the dataset.
We can see that variables that show the status in different time of the day are strongly correlated with each other, which means that if the temperature is high at 9AM is it expected to be high also during 3PM, and vice versa.
3.3.3 Boxplot of the numerical features
After that we want to see a little bit more, the distribution of the numerical features and see whether they are any specific outliers and that might have an affect in the overall prediction task.
We can see that for the rainfall which is measured in millimetres there are a lot of outliers but the furthest one is 300mm which could indicate a real day when there was raining that amount of rain in a specific area, and then for the wind speed we can see that there are outliers until 80 km/h which could be logical. When looking at each of the variables the outliers make sense, and therefore we do not want them to remove.
3.3.4 Histogram for the numerical features
In order to see the skewness of the numerical features we need to plot histograms for each of the variables.
We can see that most of the numerical features show a normal distribution, only for the three plots in the second row, we can see that they are sightly left-skewed.
3.4 Weather: Modelling
3.4.1 Modeling Summary
Now, before applying only the six classification algorithms, we create new dataframes where one of them is the dataset with all variables, the other one is the dataset with all variables with a mean zero and standard deviation 1, we apply the SelectKBest algorithm to check which are the ten best features of the 19 that the dataset has after the preprocessing part, and those variables are ‘MaxTemp’, ‘Rainfall’, ‘WindGustSpeed’, ‘WindSpeed3pm’, ‘Humidity9am’, ‘Humidity3pm’, ‘Pressure9am’, ‘Pressure3pm’, ‘Temp3pm’, ‘RainToday’, and then we will check the performance metrics of the 6 different classification algorithms, and then the last dataset would be the one with 10 best feature selection variables with scaled data, mean of zero and standard deviation of 1.
Now that we move into the modelling part, we first do the splitting of the training and testing sets, with a 0.2 ratio, and the random state is set, since we do not want to have different results each time the code runs, and to be put differently into the project report. ning and is there a chance that there might be an issue of overfitting?
| Model | Accuracy | Precision | Recall | F1-score | ROC AUC Score | Computational Time | |
|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression Classifier | 0.844454 | 0.732572 | 0.481723 | 0.581237 | 0.715468 | 0.892187 |
| 1 | Logistic Regression Classifier Scaled | 0.846757 | 0.732829 | 0.497530 | 0.592680 | 0.722572 | 0.282204 |
| 2 | Logistic Regression Classifier with Feature Se... | 0.847288 | 0.739596 | 0.491602 | 0.590623 | 0.720807 | 0.802855 |
| 3 | Logistic Regression Classifier with Feature Se... | 0.848085 | 0.737471 | 0.500099 | 0.596020 | 0.724342 | 0.286957 |
| 4 | Decision Tree Classifier | 0.789329 | 0.529062 | 0.544952 | 0.536889 | 0.702429 | 1.421627 |
| 5 | Decision Tree Classifier Scaled | 0.788311 | 0.526985 | 0.540209 | 0.533515 | 0.700086 | 1.532786 |
| 6 | Decision Tree Classifier with Feature Selection | 0.783662 | 0.516824 | 0.531120 | 0.523874 | 0.693858 | 0.979467 |
| 7 | Decision Tree Classifier with Feature Selectio... | 0.784414 | 0.518363 | 0.535467 | 0.526776 | 0.695889 | 0.952178 |
| 8 | Random Forest Classifier | 0.856940 | 0.771675 | 0.513535 | 0.616681 | 0.734826 | 24.751709 |
| 9 | Random Forest Classifier Scaled | 0.858446 | 0.775089 | 0.518870 | 0.621612 | 0.737693 | 24.805765 |
| 10 | Random Forest Classifier with Feature Selection | 0.853133 | 0.759833 | 0.503853 | 0.605917 | 0.728929 | 19.058738 |
| 11 | Random Forest Classifier with Feature Selectio... | 0.852557 | 0.755234 | 0.506026 | 0.606010 | 0.729331 | 19.643665 |
| 12 | Gradient Boosting Classifier | 0.852956 | 0.755732 | 0.508002 | 0.607586 | 0.730291 | 21.357315 |
| 13 | Gradient Boosting Classifier Scaled | 0.852646 | 0.753437 | 0.508990 | 0.607547 | 0.730442 | 21.533165 |
| 14 | Gradient Boosting Classifier with Feature Sele... | 0.849369 | 0.745052 | 0.498320 | 0.597206 | 0.724537 | 13.892945 |
| 15 | Gradient Boosting Classifier with Feature Sele... | 0.849369 | 0.743041 | 0.501087 | 0.598537 | 0.725521 | 13.645543 |
| 16 | KNN Classifier | 0.844100 | 0.708559 | 0.516894 | 0.597738 | 0.727746 | 3.558234 |
| 17 | KNN Classifier Scaled | 0.835865 | 0.684872 | 0.495554 | 0.575032 | 0.714851 | 3.281919 |
| 18 | KNN Classifier with Feature Selection | 0.838743 | 0.687253 | 0.514523 | 0.588475 | 0.723451 | 4.266080 |
| 19 | KNN Classifier with Feature Selection Scaled | 0.838831 | 0.686924 | 0.515906 | 0.589258 | 0.724000 | 5.603025 |
| 20 | AdaBoost Classifier | 0.846580 | 0.735676 | 0.492195 | 0.589795 | 0.720561 | 5.064919 |
| 21 | AdaBoost Classifier Scaled | 0.846402 | 0.735642 | 0.491010 | 0.588932 | 0.720025 | 5.170863 |
| 22 | AdaBoost Classifier with Feature Selection | 0.845074 | 0.736667 | 0.480340 | 0.581509 | 0.715375 | 3.629167 |
| 23 | AdaBoost Classifier with Feature Selection Scaled | 0.845163 | 0.737401 | 0.479945 | 0.581448 | 0.715292 | 3.525300 |
After running the six classification machine learning algorithms (on the 24 supposed datasets), we can see that the best performing model when comparing the metrics such as accuracy, precision, recall, f1 score, roc auc score and the computational time is Random Forest Classifier with scaled data (mean 0, and standard deviation of 1).
Accuracy: In this case, it is 85.84%, which means that the model correctly predicts whether it will rain or not about 85.84% of the time. In the context of predicting weather conditions, accuracy is crucial as it directly reflects the model’s ability to provide reliable forecasts, which is valuable for making informed decisions, planning activities, and managing resources effectively.
Precision: A precision of 77.51% means that out of all the instances the model predicted as rain, 77.51% of them were actually rain.
Recall: Also known as sensitivit, and with a score of 51.89% means that the model correctly identifies 51.89% of the actual instances of rain.
F1-score: In this case, the F1-score is 62.16%. A higher F1-score indicates better model performance. An F1-score of 62.16% suggests that the model has achieved a fair balance between precision and recall
ROC AUC Score: ROC AUC score, in this case, 73.77% indicates a very good discrimination between the positive and negative classes.
Computational Time: It took approximately 24.81 seconds for the model to train and make predictions.
3.4.2 Best Model Performance
Now we are curious about how Random Forest performs with scaled data, both with and without cross-validation, will perform, will it be consistent, or will there be differences?
| Model | Accuracy | Precision | Recall | F1-score | ROC AUC Score | Computational Time | |
|---|---|---|---|---|---|---|---|
| 0 | Random Forest Classifier | 0.857118 | 0.770183 | 0.516499 | 0.618332 | 0.735994 | 29.336895 |
| 1 | Random Forest Classifier Scaled | 0.858313 | 0.774727 | 0.518475 | 0.621212 | 0.737467 | 28.821689 |
| 2 | Random Forest Classifier with Feature Selection | 0.852956 | 0.754982 | 0.508990 | 0.608049 | 0.730642 | 22.569157 |
| 3 | Random Forest Classifier with Feature Selectio... | 0.852159 | 0.754734 | 0.504051 | 0.604431 | 0.728372 | 21.172029 |
| 4 | Random Forest Classifier Scaled with Cross Val... | 0.856830 | 0.761068 | 0.513028 | 0.612894 | 0.885441 | 117.668163 |
Overall, from the table above we can see that both models have similar accuracy and performance in terms of precision, recall, and F1-score, the second model with cross-validation demonstrates superior discrimination between classes, as evidenced by its higher ROC AUC score. However, this improvement comes at the cost of increased computational time.
When the results of a model with and without cross-validation are almost the same, it means the model is consistent and doesn’t rely heavily on how the data is split for validation. This suggests the model is stable and can generalize well to new data.
3.5 Weather: Additional Techniques
3.5.1 Learning Curve
Now, for the best performace model, we want to see the learning curve of this model.
The learning curve plot, as shown in Weather Figure 1, reveals that the model achieves a perfect score when trained on 10,000 samples, indicating its ability to memorize the training data entirely. However, the most significant improvement in performance for the testing data occurs when using 30,000 samples. Beyond this point, additional samples result in minimal enhancements in performance. So, when the training accuracy remains relatively stable while the testing accuracy improves slightly with an increase in the number of samples, it indicates that the model is learning to generalize better as more data is provided.
3.5.2 Checking for Overfitting
Now we want to check overfitting with the Random Forest Classifier Scaled Dataset using different settings: max depth ranging from 1 to 20, and the number of estimators set at 50, 100, and 150. By testing various configurations, we aim to understand how the model’s performance changes with different complexities. This helps us identify the optimal balance between model complexity and generalization ability.
3.5.2.1 Random Forest Classifier with 50 estimators
3.5.2.2 Random Forest Classifier with 100 estimators
3.5.2.3 Random Forest Classifier with 150 estimators
When exploring different combinations of max depth and number of estimators for the Random Forest Classifier, we observed from Weather Figure 2a), 2b), and 2c), that increasing the max depth generally led to improved performance metrics on the training set, including accuracy, precision, recall, F1-score, and ROC AUC score. However, the performance on the testing dataset showed fluctuations, with some max depths performing better than others. From the plots, it’s evident that the training scores consistently improve with increasing max depth, but the testing scores fluctuate, indicating potential overfitting.
The optimal max depth appears to be in the range of 6 to 8, where the differences in performance metrics between different depths are minimal, suggesting a balance between model complexity and generalization. This range offers good performance on both the training and testing datasets while reducing the risk of overfitting.
Interestingly, varying the number of estimators 50, 100, and 150 in the Random Forest did not significantly impact the shape or trend of the learning curves. Despite differences in the number of trees in the forest, the overall behavior of the model, as reflected in the learning curves, remained consistent. This suggests that increasing the number of estimators beyond a certain point may not lead to substantial improvements in model performance. Therefore, it is very important to consider the trade-off between computational complexity and performance when selecting the number of estimators.
3.6 Weather: Key Findings
The Random Forest Classifier with 100 estimators and a maximum depth of 6 exhibits optimal performance for this dataset. Increasing the number of estimators beyond 100 does not significantly enhance model performance, suggesting diminishing returns. Effective preprocessing steps, including missing value handling and feature selection, contribute to improved model interpretability and performance. Additionally, the dataset’s balanced class distribution ensures robust model training and evaluation.
4 Health Sector: Cardiovascular Dataset
4.1 Cardiovascular: Data Overview
The Cardiovascular Dataset was taken from Kaggle (available from this link: https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset). This involves examining a healthcare dataset with the goal of predicting the reasons behind various diseases, particularly heart disease. This dataset has 308,854 observations and 19 features, including lifestyle factors, personal details, habits, and the presence of different diseases.
In this dataset, there are 12 categorical variables and 7 numerical variables. Below you can see the feature names along with their descriptions below.
| Column Name | Description |
|---|---|
| General_Health | Indicates the general health status of the individual, categorized as ‘Poor’, ‘Very Good’, ‘Good’, ‘Fair’, or ‘Excellent’. |
| Checkup | Indicates the frequency of medical checkups, with options such as ‘Within the past 2 years’, ‘Within the past year’, ‘5 or more years ago’, ‘Within the past 5 years’, or ‘Never’. |
| Exercise | Indicates whether the individual engages in regular exercise, with options ‘Yes’ or ‘No’. |
| Heart_Disease | Indicates the presence or absence of heart disease, with options ‘Yes’ or ‘No’. |
| Skin_Cancer | Indicates the presence or absence of skin cancer, with options ‘Yes’ or ‘No’. |
| Other_Cancer | Indicates the presence or absence of other types of cancer, with options ‘Yes’ or ‘No’. |
| Depression | Indicates whether the individual suffers from depression, with options ‘Yes’ or ‘No’. |
| Diabetes | Indicates the presence or absence of diabetes, with options including ‘Yes’ or ‘No’. |
| Arthritis | Indicates the presence or absence of arthritis, with options ‘Yes’ or ‘No’. |
| Sex | Indicates the gender of the individual, with options ‘Female’ or ‘Male’. |
| Age_Category | Indicates the age category of the individual, such as ‘70-74’, ‘60-64’, ‘75-79’, ‘80+’, etc. |
| Height_(cm) | Indicates the height of the individual in centimeters. |
| Weight_(kg) | Indicates the weight of the individual in kilograms. |
| BMI | Indicates the Body Mass Index (BMI) of the individual. |
| Smoking_History | Indicates the smoking history of the individual, with options ‘Yes’ or ‘No’. |
| Alcohol_Consumption | Indicates the frequency of alcohol consumption, measured in units. |
| Fruit_Consumption | Indicates the frequency of fruit consumption per week, measured in servings. |
| Green_Vegetables_Consumption | Indicates the frequency of green vegetables consumption per week, measured in servings. |
| FriedPotato_Consumption | Indicates the frequency of fried potato consumption per week, measured in servings. |
4.2 Cardiovascular: Exploratory Data Analysis
4.2.1 A Series of Boxplots
For the exploratory data analysis, we start generating a series of boxplots for numerical columns in the “cardio” dataset, representing the distribution of various health and lifestyle variables. These visualizations are helpful for understanding the spread and central tendencies of the data.
The provided data exhibits unusual extremes for height, weight, and BMI, with maximum values of 241 cm, 293 kg, and 99.33 respectively, as well as minimum values of 91 cm and 24 kg. Given that this data was collected from adults, such extremes are uncommon and likely represent outliers. These outliers should be removed during the data cleaning process to ensure the dataset’s integrity for analysis.
4.2.2 A Collection of Histograms
This visualization shows a collection of histograms, each representing the distribution of each numerical variable.
Height (cm): The data appears to be normally distributed, centered around a mean value which looks to be approximately 170 cm.
Weight (kg): Similar to the height distribution, it seems normally distributed, with a mean value somewhere around 60-100 kg.
BMI: Most of the people in this dataset have BMI value between 25 and 30 which is categorized as Overweight. However there are significant number of people who fall under Normal (18 - 25) and Obese (30 - 35) category.
Alcohol Consumption: Most of the people in this dataset consume very low or approximetely 0% of alcohol. This Alcohol consumption graph is heavily right skewed.
Fruit Consumption: This fruit consumption graph shows irregular patterns with multiple peaks, indicating variability in people’s diets.
Green Vegetables Consumption: Similar to fruit consumption, this variable also appears to be multimodal. There are peaks at the lower end of the scale, indicating that a portion of the population consumes green vegetables infrequently.
Fried Potato Consumption: Similar to fruit and green vegetables consumption, this histogram is not normal and is skewed to the right, with a large number of individuals consuming fried potatoes infrequently, and a few consuming them very frequently.
4.2.3 Target Variable: Heart_Disease
Next, we generated a histogram to show a comparison of individuals with and without heart disease.
As the result shown, the “No” bar is significantly higher than the “Yes” bar with 283,883 and 24,971, respectively, indicating a much larger number of individuals in the sample do not have heart disease.
4.2.4 Correlation Matrix
After that, we copied an original data and create a new dataframe and then converted categorical columns into numerical variables to perform a correlation matrix.
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 10 | 150.0 | 32.66 | 14.54 | 1 | 0.0 | 30.0 | 16.0 | 12.0 |
| 1 | 4 | 4 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 10 | 165.0 | 77.11 | 28.29 | 0 | 0.0 | 30.0 | 0.0 | 4.0 |
| 2 | 4 | 4 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 8 | 163.0 | 88.45 | 33.47 | 0 | 4.0 | 12.0 | 3.0 | 16.0 |
| 3 | 3 | 4 | 1 | 1 | 0 | 0 | 0 | 2 | 0 | 1 | 11 | 180.0 | 93.44 | 28.73 | 0 | 0.0 | 30.0 | 30.0 | 8.0 |
| 4 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 12 | 191.0 | 88.45 | 24.37 | 1 | 0.0 | 8.0 | 4.0 | 0.0 |
This heatmap is useful for quickly identifying potential relationships between health-related factors.
As our target variable is the presence of heart disease (Heart_Disease), we plot the bar chart to show the correlation coefficients of various factors with heart disease.
As shown, the factor most strongly correlated with heart disease is “Age_Category (0.23)”, followed by “Diabetes (0.17)”, “Arthritis (0.15)”, and “Smoking_History (0.11)”. Other factors like “Sex (0.07)”, “BMI (0.04)”, “Depression (0.03)”, and dietary habits have lower correlations. “Exercise (-0.10)” and “Alcohol Consumption (-0.04)” have the least correlation. Essentially, the chart identifies age, diabetes, arthritis, and smoking as having stronger associations with heart disease in the studied population.
4.3 Cardiovascular: Preprocessing Steps
4.3.1 Identifying and Handling Outliers
Before identifying and handling outliers, we started checking for missing data, including identifying and removing duplicate values. It showed that there are no missing values in each column of the dataset. However, we found 80 duplicated rows in this dataset, which were subsequently removed. Duplicate values or entries might occur if a person enters repeated values expectedly or unexpectedly. After this, we checked for the number of unique values in each column.
As some unusual extremes shown in the series of boxplot before, we will remove outliers from the Height, Weight, and BMI attributes in this step. However, we will retain outliers in the Alcohol, Fruit, Green Vegetables, and Fried Potato consumption attributes since their accuracy is uncertain.
To mitigate the influence of extreme cases on the results, outliers were excluded for 1,955 rows, aiming to prevent potential inaccuracies. As a result, our datasets still contain 306,899 observations.
4.3.2 Training and Testing Split
Before training and testing split, we performed label encoding on a copy of the original dataframe named “cardio_encoded”. After label encoding process, we checked the data types of variables in the “cardio_encoded” DataFrame to ensure that there are no string-format variables.
Then, we splited the dataset into training and testing sets and standardized the feature variables, preparing it for machine learning modeling. Finally, the scaled data is converted back into DataFrame format for further analysis.
4.3.3 Feature Selection: KBest
This step involves feature selection using the SelectKBest method with a k value of 10. The target variable is set to ‘Heart_Disease’. The best selected features include ‘Checkup’, ‘Exercise’, ‘Heart_Disease’, ‘Skin_Cancer’, ‘Depression’, ‘Diabetes’, ‘Arthritis’, ‘Sex’, ‘Height_(cm)’, and ‘BMI’. Finally, the selected features are applied to both the training and testing sets, as well as their scaled versions, by dropping the non-selected features from the datasets.
4.3.4 Improving Class Imbalance by Resampling (Undersampling)
In this step, we calculated the imbalance ratio of the target variable “Heart_Disease” in the dataset.
As the result shown in the Cardio Figure 1, this ratio highlights a potential class imbalance, indicating that one class (individuals with heart disease) may be underrepresented compared to the other class.
Then, we addressed the class imbalance issue through undersampling of the majority class. Initially, the dataset is divided into majority (no heart disease) and minority (heart disease present) classes. Then, a portion of the majority class is randomly downsampled to achieve a balanced class distribution with an 80:20 ratio between the majority and minority classes.
Now, the majority class 0 (not having heart disease) and the minority class 1 (having heart disease) have 99,204 and 24,801 observations, respectively, as you can see in the Cardio Figure 2.
After undersampling, the dataset is split into training and testing sets, followed by standardization of the features. Finally, the scaled data is converted back to DataFrames for further analysis.
4.4 Cardiovascular: Modeling
4.4.1 Modeling Summary
| Model | Accuracy | Precision | Recall | F1-score | ROC AUC Score | Computational Time | |
|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression with All Variables | 0.918899 | 0.422727 | 0.018811 | 0.036019 | 0.508280 | 1.467543 |
| 1 | Logistic Regression with Scaled Data | 0.918964 | 0.447552 | 0.025890 | 0.048948 | 0.511545 | 0.302805 |
| 2 | Logistic Regression with Feature Selection | 0.919029 | 0.400000 | 0.010518 | 0.020497 | 0.504568 | 1.168535 |
| 3 | Logistic Regression with Scaled Feature Selection | 0.919094 | 0.412698 | 0.010518 | 0.020513 | 0.504603 | 0.199655 |
| 4 | Logistic Regression with Resampling | 0.814080 | 0.580083 | 0.254839 | 0.354111 | 0.604361 | 0.671731 |
| 5 | Logistic Regression with Scaled Resampling | 0.814282 | 0.579443 | 0.260282 | 0.359210 | 0.606528 | 0.107246 |
| 6 | Decision Tree with All Variables | 0.862447 | 0.195792 | 0.227751 | 0.210566 | 0.572900 | 1.181523 |
| 7 | Decision Tree with Scaled Data | 0.862447 | 0.195792 | 0.227751 | 0.210566 | 0.572900 | 1.160869 |
| 8 | Decision Tree with Feature Selection | 0.900749 | 0.210977 | 0.084749 | 0.120924 | 0.528492 | 0.386444 |
| 9 | Decision Tree with Scaled Feature Selection | 0.900929 | 0.211568 | 0.084345 | 0.120607 | 0.528405 | 0.392200 |
| 10 | Decision Tree with Resampling | 0.749083 | 0.382161 | 0.412903 | 0.396938 | 0.623013 | 0.465445 |
| 11 | Decision Tree with Scaled Resampling | 0.749163 | 0.382039 | 0.411694 | 0.396312 | 0.622610 | 0.459432 |
| 12 | Random Forest with All Variables | 0.918638 | 0.434896 | 0.033778 | 0.062688 | 0.514967 | 18.796389 |
| 13 | Random Forest with Scaled Data | 0.918638 | 0.430556 | 0.031351 | 0.058446 | 0.513859 | 19.802505 |
| 14 | Random Forest with Feature Selection | 0.903486 | 0.220639 | 0.078277 | 0.115557 | 0.527027 | 12.010410 |
| 15 | Random Forest with Scaled Feature Selection | 0.903698 | 0.217747 | 0.075445 | 0.112062 | 0.525851 | 11.881666 |
| 16 | Random Forest with Resampling | 0.818394 | 0.589272 | 0.303427 | 0.400586 | 0.625279 | 7.939783 |
| 17 | Random Forest with Scaled Resampling | 0.817265 | 0.581930 | 0.306452 | 0.401479 | 0.625707 | 7.918733 |
| 18 | Gradient Boosting with All Variables | 0.919436 | 0.498660 | 0.037621 | 0.069964 | 0.517154 | 19.082682 |
| 19 | Gradient Boosting with Scaled Data | 0.919436 | 0.498660 | 0.037621 | 0.069964 | 0.517154 | 18.692237 |
| 20 | Gradient Boosting with Feature Selection | 0.919306 | 0.285714 | 0.001214 | 0.002417 | 0.500474 | 8.275284 |
| 21 | Gradient Boosting with Scaled Feature Selection | 0.919306 | 0.285714 | 0.001214 | 0.002417 | 0.500474 | 8.293972 |
| 22 | Gradient Boosting with Resampling | 0.823757 | 0.607130 | 0.336492 | 0.433000 | 0.641030 | 7.615644 |
| 23 | Gradient Boosting with Scaled Resampling | 0.823757 | 0.607130 | 0.336492 | 0.433000 | 0.641030 | 7.552372 |
| 24 | KNeighbors with All Variables | 0.911470 | 0.201946 | 0.033576 | 0.057579 | 0.510976 | 7.912075 |
| 25 | KNeighbors with Scaled Data | 0.909433 | 0.296761 | 0.090817 | 0.139074 | 0.535982 | 7.341052 |
| 26 | KNeighbors with Feature Selection | 0.911258 | 0.243629 | 0.048341 | 0.080675 | 0.517597 | 1.523488 |
| 27 | KNeighbors with Scaled Feature Selection | 0.911030 | 0.251681 | 0.052994 | 0.087552 | 0.519595 | 4.721148 |
| 28 | KNeighbor with Resampling | 0.777509 | 0.390416 | 0.200403 | 0.264855 | 0.561091 | 1.297843 |
| 29 | AdaBoost with All Variables | 0.918785 | 0.466882 | 0.058455 | 0.103901 | 0.526304 | 4.393100 |
| 30 | AdaBoost with Scaled Data | 0.918785 | 0.466882 | 0.058455 | 0.103901 | 0.526304 | 4.312562 |
| 31 | AdaBoost with Feature Selection | 0.919225 | 0.455696 | 0.014563 | 0.028224 | 0.506520 | 2.403135 |
| 32 | AdaBoost with Scaled Feature Selection | 0.919225 | 0.455696 | 0.014563 | 0.028224 | 0.506520 | 2.410494 |
| 33 | AdaBoost with Resampling | 0.822628 | 0.604469 | 0.327218 | 0.424591 | 0.636846 | 1.765493 |
| 34 | AdaBoost with Scaled Resampling | 0.822628 | 0.604469 | 0.327218 | 0.424591 | 0.636846 | 1.767104 |
Based on the results of the model performance, we can conclude the best model performance summary based on each matrix as follows:
Best Accuracy: The highest accuracy is achieved by both Gradient Boosting models with all variables and scaled data. This means the models are correct in their predictions about 91.94% of the time.
Best Precision: The models with the highest precision are Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling. This indicates that when these models predict heart disease, it is correct 60.71% of the time.
Best Recall: The model with the highest recall is **Decision Tree with Resampling, meaning this model correctly identifies about 41.29% of all true cases of heart disease.
Best F1-score: Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling have the highest F1-scores (43.30%), suggesting they have the best balance between precision and recall which means neither missing too many real cases (high recall) nor making too many false positive (high precision).
Best ROC AUC Score: Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling achieve the highest ROC AUC scores (64.10%), indicating their strong ability to distinguish between patients with and without heart disease.
Best Computational Time: The Logistic Regression with Scaled Resampling is the fastest to make its predictions, making it the best choice if you need quick prediction
For heart disease prediction, it is essential to select a model that not only has high accuracy but also a strong ability to correctly identify as many actual cases as possible (high recall) and correctly predict heart disease when it is truly present (high precision). Additionally, the ability to distinguish between the classes (high ROC AUC) and a good balance between precision and recall (high F1-score) are particularly important.
Considering the criticality of all these metrics in a healthcare context, the Gradient Boosting with Resampling and Gradient Boosting with Scaled Resampling models are the best choices. They not only provide the highest precision and F1-scores, indicating a robust balance between precision and recall, but also the highest ROC AUC scores, demonstrating excellent discriminative ability. However, we will select only the Gradient Boosting with Resampling for further analysis since scaling resampling did not significantly impact the model’s effectiveness compared to using original resampling techniques.
4.4.2 Best Model Performance
Now, we want to see how Gradient Boosting models with all variables and resmapling data perform with and without cross-validation. Are the results consistent, or do they vary?
| Model | Accuracy | Precision | Recall | F1-score | ROC AUC Score | Computational Time | |
|---|---|---|---|---|---|---|---|
| 0 | Gradient Boosting with All Variables | 0.919436 | 0.498660 | 0.037621 | 0.069964 | 0.517154 | 19.295390 |
| 1 | Gradient Boosting with All Variables (CV) | 0.919611 | 0.536702 | 0.047187 | 0.086730 | 0.836876 | 74.838946 |
| 2 | Gradient Boosting with Resampling | 0.823757 | 0.607130 | 0.336492 | 0.433000 | 0.641030 | 8.050771 |
| 3 | Gradient Boosting with Resampling (CV) | 0.822699 | 0.604184 | 0.329218 | 0.426168 | 0.835305 | 31.133341 |
Overall, both Gradient Boosting models demonstrate consistent performance between the original and cross-validated versions. However, the cross-validated models generally exhibit higher precision, recall, and F1-score, indicating better generalization and robustness. Nevertheless, this improvement in performance comes with the trade-off of increased computational time.
When a model produces similar results with and without cross-validation, it indicates that the model’s performance is consistent and not heavily influenced by how the data is split for validation. This suggests that the model is robust and capable of generalizing effectively to unseen data.
4.5 Cardiovascular: Additional Techiniques
4.5.1 Learning Curve
After evaluating the performance of various models for heart disease prediction, it is clear that Gradient Boosting models, especially those with resampling techniques, outperform others. Now, for the best model performance, we want to see the learning curve of this model.
The learning curve plot, as shown in Cardio Figure 3, reveals that test score plateaus around the 100,000 to 125,000 samples mark. Therefore, using a training set size within this range would be appropriate and efficient for training this particular Gradient Boosting model, as it achieves a balance between model performance and computational efficiency. Beyond this range, the benefit of additional samples diminishes.
The learning curve plot, as shown in Cardio Figure 4, reveals that the proper number of samples would be 40,000 samples, as beyond this point, the test score improvement is marginal, indicating that adding more samples may not significantly improve the model’s performance on unseen data. Therefore, a training set size slightly less than 40,000 samples might be optimal for this Gradient Boosting model with resampling.
4.5.2 Checking for Overfitting
We aim to assess overfitting in Gradient Boosting models trained on all variables and resampled data by experimenting with different hyperparameter settings. Specifically, we will vary the maximum depth from 1 to 20 and fix the number of estimators at 50, 100, and 150. By exploring these configurations, we seek to analyze how the model’s performance evolves with varying complexities. This investigation will allow us to determine the optimal trade-off between model complexity and generalization capacity.
4.5.2.1 Checking for Overfitting: Gradient Boosting with All Variable
4.5.2.1.1 Gradient Boosting Classifier with 50 estimators
4.5.2.1.2 Gradient Boosting Classifier with 100 estimators
4.5.2.1.3 Gradient Boosting Classifier with 150 estimators
In our investigation of various combinations of max depth and number of estimators for the Gradient Boosting Classifier, shown in Cardio Figure 5a), 5b), and 5c), we observed a consistent trend: increasing the max depth generally improved performance metrics on the training set, including accuracy, precision, recall, F1-score, and ROC AUC score. However, performance on the testing dataset exhibited fluctuations, with certain max depths performing better than others. These observations suggest a potential risk of overfitting.
The optimal max depth tends to fall within the range of 10 to 12, striking a balance between model complexity and generalization ability across different configurations of n_estimators (50, 100, and 150). This range consistently delivers good performance on both the training and testing datasets while mitigating the risk of overfitting.
Interestingly, the choice of n_estimators does not significantly alter the observed trends. Although higher values may offer slightly better performance, the overall behavior of the model, as reflected in the learning curves, remains consistent.
4.5.2.2 Checking for Overfitting of Gradient Boosting with Resampling Data
4.5.2.2.1 Gradient Boosting Classifier (Resampling) with 50 estimators
4.5.2.2.2 Gradient Boosting Classifier (Resampling) with 100 estimators
4.5.2.2.3 Gradient Boosting Classifier (Resampling) with 150 estimators
In our investigation of various combinations of max depth and number of estimators for the Gradient Boosting Classifier with Resampling, shown in Cardio Figure 6a), 6b), and 6c), we observed that increasing the max depth generally improves performance on the training set but may lead to overfitting on the testing set. The optimal max depth appears to be around 12, where a good balance between model complexity and generalization is achieved across different configurations of n_estimators.
5 References
3.2.4.3.5. sklearn.ensemble.GradientBoostingClassifier — scikit-learn 0.20.3 documentation. (2009). Scikit-Learn.org. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
Alves, L. M. (2021, July 2). KNN (K Nearest Neighbors) and KNeighborsClassifier — What it is, how it works, and a practical…https://luis-miguel-code.medium.com/knn-k-nearest-neighbors-and-kneighborsclassifier-what-it-is-how-it-works-and-a-practical-914ec089e467
Giola, C., Danti, P., & Magnani, S. (2021, July 13). Learning curves: A novel approach for robustness improvement of load forecasting. MDPI. https://www.mdpi.com/2673-4591/5/1/38#metrics
IBM. (2022). What Is Logistic Regression? IBM.https://www.ibm.com/topics/logistic-regression
IBM. (2023a). What is a Decision Tree | IBM.https://www.ibm.com/topics/decision-trees
IBM. (2023b). What is Random Forest? | IBM.https://www.ibm.com/topics/random-forest
Jason Brownlee. (2018, November 20). A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning. Machine Learning Mastery.https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
Nair, R., & Bhagat, A. (2019, April 6). Feature Selection Method To Improve The Accuracy of Classification Algorithm. International Journal of Soft Computing and Engineering. https://www.ijitee.org/wp-content/uploads/papers/v8i6/F3421048619.pdf
Snieder, E., Abogadil, K., & T. Khan, U. (2020). Resampling and ensemble techniques for improving ANN-based high flow forecast accuracy. Department of Civil Engineering, York University. https://hess.copernicus.org/preprints/hess-2020-430/hess-2020-430-manuscript-version4.pdf
Scikit-learn. (2018). sklearn.ensemble.RandomForestClassifier — scikit-learn 0.20.3 documentation. Scikit-Learn.org. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Wizards, D. S. (2023, July 7). Understanding the AdaBoost Algorithm. Medium.https://medium.com/@datasciencewizards/understanding-the-adaboost-algorithm-2e9344d83d9b
Yanminsun, S., Hu, H., Xue, B., Zhang, M., & Zhang, C. (2011). Optimized feature selection and enhanced collaborative representation for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 50(11), 4300-4312.https://www.researchgate.net/publication/263913891_Classification_of_imbalanced_data_a_review
Muralidhar, K. S. V. (2023, July 7). Learning Curve to identify Overfitting and Underfitting in Machine Learning. Medium. https://towardsdatascience.com/learning-curve-to-identify-overfitting-underfitting-problems-133177f38df5#:~:text=Learning%20curve%20of%20an%20overfit%20model%20has%20a%20very%20low
Programmer, P. (2023, May 17). Evaluation Metrics for Classification. Medium. https://medium.com/@impythonprogrammer/evaluation-metrics-for-classification-fc770511052d
What is Overfitting? - Overfitting in Machine Learning Explained - AWS. (n.d.). Amazon Web Services, Inc. Retrieved May 31, 2024, from https://aws.amazon.com/what-is/overfitting/#:~:text=Underfitting%20vs
The links for the three datasets are listed below, as hyperlinks: